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ABSTRACT 

Lung cancer is the most common cause of cancer- 
related mortality with more than 1.4 million deaths 
per year worldwide. To search for significant 
somatic alterations in lung cancer, we analyzed, 
integrated and manually curated various data sets 
and literatures to present an integrated genomic 
database of non-small cell lung cancer 
(IGDB.NSCLC, http://igdb.nsclc.ibms.sinica.edu.tw). 
We collected data sets derived from hundreds of 
human NSCLC (lung adenocarcinomas and/or 
squamous cell carcinomas) to illustrate genomic 
alterations [chromosomal regions with copy 
number alterations (CNAs), gain/loss and loss of 
heterozygosity], aberrant expressed genes and 
microRNAs, somatic mutations and experimental 
evidence and clinical information of alterations 
retrieved from literatures. IGDB.NSCLC provides 
user friendly interfaces and searching functions to 
display multiple layers of evidence especially 
emphasizing on concordant alterations of CNAs 
with co-localized altered gene expression, aberrant 
microRNAs expression, somatic mutations or genes 
with associated clinicopathological features. These 
significant concordant alterations in NSCLC are 
graphically or tabularly presented to facilitate and 
prioritize as the putative cancer targets for patho- 
logical and mechanistic studies of lung tumorigen- 
esis and for developing new strategies in clinical 
interventions. 



INTRODUCTION 

Cancer is the leading killer of human beings with more 
than 7.4 million deaths worldwide per year. Among them, 
lung cancer is the most common cause of cancer-related 
mortality in both men and women with over 1.4 million 
deaths annually (1). Lung cancer can be divided into two 
categories: small cell lung cancer (SCLC) and non-small 
cell lung cancer (NSCLC). Approximately 85% of total 
lung cancers are NSCLC that can be further classified into 
two major subtypes: squamous cell carcinoma (SCC) and 
adenocarcinoma (AD). Recent remarkable advances in 
identification of genetic alterations including p53 7 KRAS, 
EGFR, HER2, c-MET, LKB1, PIK3CA, BRAF and 
EML4-ALK not only provided mechanistic understanding 
of tumor growth and metastatic advantages but also 
served as targets for diagnosis and therapy (2-12). 
Moreover, a recent report suggested that high risk lung 
cancer individuals aged 55-74 years with heavy smoking 
history is associated with a 20% reduction in lung cancer 
mortality when computed tomography (CT) lung cancer 
screening is conducted in comparison with chest X-ray 
screening (http://www.cancer.gov/images/dsmb-nlst.pdf). 
Even with these recent advances, early screening of lung 
cancer is still inconsistent and the overall 5-year survival 
rate of lung cancer remains around 15% (13-15). 
Therefore, discovery of altered genes for early detection 
and as therapeutic targets is urgently needed to prolong 
the life of lung cancer patients. 

Similar to many other cancers, lung cancer is a complex 
and heterogeneous genetic disease resulting from the ac- 
cumulation of genetic and epigenetic alterations during 
multiple steps of tumor progression and metastasis 
(16,17). Depending on tumor subtype, ethnicity, gender 
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and exposure of carcinogens (e.g. smoking, radon gas, 
asbestos and cooking oil fumes), accumulated alterations 
exploit different tumorigenic mechanisms resulting in 
aberrant activation of oncogenic signaling pathways and 
uncontrolled tumor growth and metastasis (18-20). To 
identify common cancer-associated alterations, applica- 
tion of high-throughput genomic technologies including 
DNA sequencing and high density microarrays allowed 
investigators to generate comprehensive genome-wide 
data sets of somatic mutations, copy number alterations 
(CNAs), transcriptomics and altered microRNA 
(miRNAs) expression of lung cancer genomes (21-33). 
In combination with clinical features, investigators were 
able to independently identified profiles from these 
genomic alterations for categorization of lung cancer 
subtypes, identification of cancer genes for tumorigenic 
studies, diagnostic and prognostic prediction of patient 
outcomes and development of various strategies to 
improve patient care. 

To facilitate identification of common alterations that 
might embrace oncogenic or tumor suppressive genes in 
NSCLC, we curated numerous genomic data sets and 
established the integrated database IGDB. NSCLC. We 
graphically displayed multiple dimensional data of lung 
cancer somatic alterations including somatic mutations, 
CNAs, aberrant expression of genes and miRNAs, 
chromosomal gain and loss regions, frequent loss of het- 
erozygosity (LOH) regions and experimental supports of 
alterations retrieved from lung cancer literatures. Altered 
gene expression in association with clinicopathological 
features and patient outcomes is also included for develop- 
ing biomarkers in clinical managements. Although lung 
cancer is the most common cause of cancer-related mor- 
tality with rapid expansion of cancer genomic studies, 
limited efforts were made to integrate these genomic 
resources for a coherent and user-friendly presentation. 
A database HLungDB was reported as a helpful 
resource of human lung cancer research with integrated 
and networking analysis of lung cancer-related genes, 
proteins and miRNAs extracted manually from literatures 
(34). However, IGDB. NSCLC provides not only lung 
cancer genes and miRNAs from published reports but 
also analyzed various lung cancer genomic data sets for 
simultaneous illustrations of somatic alterations at 
genome, RNA, protein, function and application levels. 
The simultaneous illustrations provide multiple layers of 
evidence for investigators to detect and prioritize the 
common altered genes and miRNAs accessible to the sci- 
entific and medical communities. Moreover, we specific- 
ally emphasized on concordant alterations in altered 
NSCLC cancer genomes such as (i) up-regulated genes 
and miRNAs encoded in regions of genomic amplifica- 
tion; (ii) down-regulated genes and miRNAs encoded in 
regions of genomic deletion; (iii) mutation genes with 
concordant genomic alterations; and (iv) genes with sig- 
nificant association with clinical information encoded in 
regions of genomic alterations. These altered genes defined 
as significant genes and miRNAs in NSCLC were graph- 
ically and tabularly displayed in IGDB. NSCLC and 
should be prioritized to develop as useful targets for 
improvement of patient management. 



DATABASE CONSTRUCTION 

For integration of various genomic resources of NSCLC, 
we collected genomic data generated from primary tissue 
samples of lung AD and SCC separately (Supplementary 
Table SI). For altered gene expression performed in 
microarrays, we collected 15 data sets from 1099 AD, 
295 SCC and 189 normal tissue samples from five different 
microarray platforms. An expression profiling data set 
(GSE14814) of 90 NSCLC samples with and without cis- 
platin/vinorelbine treatment was included for revealing 
potential prognostic markers in NSCLC (35). We also 
included a reported pair data set of 193 lung AD 
samples with genomic alteration performed in 44 K 
Agilent CGH array and expression alteration in 
Affymetrix U133A and U133 2.0 arrays for concordant 
alterations from the same samples (36). 

For processing Affymetrix gene expression data sets, the 
raw data (CEL files) were normalized by MAS 5.0 (an affy 
package from Bioconductor/R at http://www 
.bioconductor.org/packages/release/bioc/html/affy.html) 
(37). Quality control (QC) results were performed by the 
simpleaffy and affyQCReport packages (http://www 
.bioconductor.org/packages/release/bioc/html/ 
affyQCReport.html and http://www.bioconductor.org/ 
packages/release/bioc/html/simpleaffy.html) with three 
stringent QC criteria for inclusion of a microarray data: 
(i) the scaling factor of a given sample should be within 
two standard deviations of mean of the same array 
platform, (ii) the present calls of a given sample should 
be >25% and (iii) the 3'/5' GAPDH ratios of a given 
sample should be <3 (38). After these QC process, 46 
samples were eliminated (HG_U95A: 22 samples, 
HG-U133A: 2 samples and HG-U133_Plus_2: 22 
samples). The data sets from the same platform were 
normalized using the normalize. quantiles in R. Finally, 
all probeset intensity value was transformed to log 2 
value. For data performed in two color microarrays 
(Agilent and Stanford data sets), expression profiles were 
downloaded form GEO website directly. The log 10 ratio 
of the Agilent data set was transformed to log 2 ratio. The 
differential expression genes for each platform were 
ranked on moderated /-statistics and selected with 
i><0.01 under the Bayesian adjusted /-statistics from the 
linear models for microarray data (limma) package (39). 

For detection of CNAs, we downloaded three data sets 
of 191 AD, 117 SCC and 271 normal tissue samples per- 
formed in Affymetrix GeneChip single nucleotide poly- 
morphism (SNP) arrays and analyzed by dChip software 
(40). In brief, CEL format data are normalized using 
invariant set normalization algorithms and then 
normalized-within-chip intensity data are generated 
based on the reference data set of 50 normal individuals 
genotyped in the same platform. Based on these signal 
values, the raw copy number for an SNP in a sample is 
computed as: [log 2(intensity of SNP/mean of intensity of 
reference x 2)]. A window size of three SNPs is then 
applied for median smoothing method and to infer raw 
copy number (ICN) (41). We defined the amplified regions 
with ICN more than three and deleted regions with ICN 
less than one (42). For altered expression of miRNAs, 
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two data sets of 193 AD and 137 SCC tissue samples were 
collected and the miRNAs expression was analyzed by 
calculating the mean (± standard deviation) of the log 2 
ratios of tumor/non-tumor adjacent tissues. For analysis 
of arrayed comparative genome hybridization (aCGH) 
data, we obtained two data sets of 70 SCC tissue 
samples and analyzed them using the cghMCR package 
(http://www.bioconductor.Org/packages/2.4/bioc/html/ 
cghMCR.html). The gain and loss regions were calculated 
based on the SGOL (Segment Gain Or Loss) scores by 
calculating the summations of all the positive values 
over a threshold of copy number (CN) >3 and all the 
negative values below a threshold of CN<1.2, respect- 
ively. In addition, we also integrated at least 1112 
somatic mutation genes of NSCLC from COSMIC and 
literatures, 214 lung cancer genes with experimental 
evidence, 131 genes with association with clinico- 
pathological features of NSCLC and other genomic alter- 
ations such as LOH and minimum region of alterations 
from literatures and our unpublished data. 

RESULTS 

Multiple levels of somatic alterations in NSCLC 

To facilitate identification of cancer-associated somatic 
altered genes, we aimed to provide lines of evidence for 
investigators to prioritize these genes for mechanistic 
studies and clinical applications in NSCLC. The 
integrated framework of IGDB. NSCLC is constructed 
based on the physical map of human genome sequence 
from Ensembl with various integrated somatic alterations 
alongside (Figure 1). Users could apply various searching 
terms (gene, marker, cytogentic location and other key 
words) for displaying the altered data in lung AD, SCC 
and/or combined NSCLC. Three major illustrations 
including chromosome view, gene view and miRNA 
view are provided for simultaneous displays of concordant 
alterations in the same interface. In the chromosome view, 
the interface demonstrated the comprehensive and 
integrated alterations in relation to NSCLC genes and 
regions including evidence of experimental support, 
somatic mutation, altered gene expression, CNAs, LOH 
and other chromosomal alterations. We provided ERBB2, 
EGFR and MET as tutorial examples (please see our 
quick examples of chromosome view). In gene view, we 
presented the details of aberrant information of the 
selected gene including (i) mutation frequency and 
details of mutation changes; (ii) frequency of altered 
gene expression supported by experimental evidence such 
as immunohistochemistry (IHC), RT-PCR or western 
blot analysis and in association with clinical information; 
(iii) the fold changes and the statistic significance of 
altered expression of each probe set in a selected gene 
from a given microarray platform; (iv) the distribution 
plots of the altered expressing gene in a set of multiple 
NSCLC tissues shown in log 2 ratios of tumor/ 
non-tumor adjacent tissues; and (v) the publication 
status of the aberrant gene in various cancers provided 
in the cancer gene index and the literature (see our quick 
examples of gene view). In miRNA view, we also provided 



details of the aberrant expression of a miRNA as similar 
to that of gene view in addition to external links for 
predicting miRNA target genes in MICROCOSM (43), 
TARGETSCAN (44), PICTAR-VERT (45) and 
miRTarBase (46). We provided three miRNAs including 
hsa-let-7e, hsa-mir-17 and hsa-mir-31 as tutorial examples 
(see our quick examples of miRNA view). 

Concordant somatic alterations in NSCLC 

To further reveal significant cancer-associated genes and 
miRNAs involved in tumor progression of NSCLC, we 
applied CNA (amplification: Inferred copy number ICN 
>3 and deletion: ICN <1) as the framework and searched 
for concordant existence of (i) genes with altered expres- 
sion, (ii) miRNA with altered expression, (iii) genes with 
common somatic mutations and (iv) genes associated with 
clinicopathological features from literatures. For genes 
with altered expression existing in regions of CNA, we 
found 27 up-regulated genes located in amplified regions 
and seven down-regulated genes residing in deleted 
regions in AD; and 23 up-regulated genes in amplified 
regions and 13 down- regulated genes in deleted regions 
in SCC (Figure 2 and Supplementary Table S2). Some 
of these genes were previously identified as altered expres- 
sion genes in NSCLC for validation but majority of these 
genes remain unknown in the tumorigenesis of NSCLC. 
For miRNAs with altered expression residing in CNA 
regions, we identified 21 up-regulated miRNAs located 
in the amplified regions and 19 down-regulated miRNAs 
in the deleted regions in AD, and 19 up-regulated 
miRNAs in the amplified regions and 20 down-regulated 
miRNAs in the deleted regions in SCC (Supplementary 
Figure SI and Supplementary Table S3). Similarly, we 
found that some of these aberrant miRNAs were validated 
by previous reports but majority of them are novel signifi- 
cant miRNAs in the tumor formation of NSCLC. 

We also examined the concordance of genes with 
common somatic mutations (at least mutations in five 
cases) that resides in the CNA regions of NSCLC. A 
total of 1112 genes with somatic mutations including 386 
in AD and 55 in SCC were downloaded from COSMIC 
database (47). Majority of these somatic mutations 
occurred only once in NSCLC tissues except 43 genes in 
AD and seven genes in SCC conferred at least five inde- 
pendent somatic mutations. Interestingly, majority of 
these common mutated genes including 33/43 (76.7%) in 
AD and 6/7 (85.7%) in SCC are co-localized with CNA 
regions in NSCLC (Table 1). Our results further 
demonstrated that genes with frequent somatic mutations 
are commonly affiliated with chromosomal alterations 
involved in the tumorigenesis of NSCLC. 

To further examine the link between clinicopathological 
features with somatic alterations, we collected 309 papers 
with 125 genes in AD and 105 genes in SCC validated by 
experimental evidence such as IHC, RT-PCR and western 
analysis for confirmation of altered expression in NSCLC. 
Among them, 62 (62/125, 49.6%) genes in AD and 54 
(54/105, 51.4%) genes in SCC extracted from 159 papers 
were shown to associate with 22 clinicopathological 
features and to be located in the CNA regions 
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Figure 1. The framework of IGDB.NSCLC. 
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Figure 2. The graphic integration of CNAs with altered expression genes in lung AD and SCC. The red lines represent the amplification regions 
for CNA and up-regulation genes. The green lines stand for the deletion regions for CNA and down-regulated genes. The lines are relatively 
corresponding to the physical position along the chromosomes. 
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Table 1. Common mutated genes in copy number alteration regions of NSCLC 



Non-small cell Alteration 
lung carcinoma 



Genes with at least five somatic mutations 



AD 



SCC 



Mutated genes in amplified regions (30 genes) 



Mutated genes in deleted regions 
Mutated genes in amplified regions 
Mutated genes in deleted regions 



APC, BRAF, CDC42BPA, CDKN2A, CTNNB1, EGFR, EPHA3, EPHA5, 
EPHA7, EPHB6, FGFR4, FLT1, INSRR, JAK2, KDR, KIAA1804, 
KRAS, LMTK2, MET, NRAS, NTRK1, PAK7, PDGFRA, PIK3C3, 
PIK3CA, PIK3CG, PRKDC, RBI, ROB02, TERT 

NOTCH 1, PTEN, TP53 

BRAF, CDKN2A, EGFR, KRAS, PIK3CA 

TP53 



(Supplementary Table S4). Our results suggested that 
NSCLC genes associated with clinicopathological 
features commonly (~50%) reside in the CNA regions, 
reflecting the genetic effects participated in pathological 
changes of tumor formation in NSCLC. 



DISCUSSION 

As far as we know, IGDB. NSCLC is the largest integra- 
tion of lung cancer genomic resources providing multiple 
levels of evidence to search for the concordantly altered 
targets and to prioritize putative NSCLC genes for future 
studies. In addition to the database with various searching 
options and user-friendly interfaces, we provided concord- 
ant somatic alterations based on the genome-wide CNAs 
data with co-localization of altered gene expression, 
aberrant miRNA expression, somatic mutation and 
genes in association with clinicopathological features. 
The high concordance of CNA data with these somatic 
alterations and clinical features in IGDB. NSCLC sug- 
gested the quality of data integration with experimental 
validations, the heterogeneity of genetic pathways to 
NSCLC and the important roles of genomic alterations 
involved in tumor formation of NSCLC. The future de- 
velopment of integrating other NSCLC resources into 
IGDB. NSCLC will focus on increasing the data of 
clinicopathological features for dissecting the altered 
genes involved in tumor progression and on somatic alter- 
ation data from next-generation sequencing results. In 
conclusion, IGDB. NSCLC is an invaluable resource for 
selecting putative cancer genes in NSCLC to better under- 
stand the heterogeneous tumorigenic mechanisms and for 
developing useful strategies in clinical applications to 
prolong the life of lung cancer patients. 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Tables 1^1, Supplementary Figure 1. 
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